Advanced topic 5, 2012
Topic 5 from Advanced Topics 2012 SYMIAN: Analysis and Performance Improvement of the IT Incident Management Process Background: Know: Recognize: IT Incident Management Process This page presents SYMIAN, a decision support tool for the performance improvement of the incident management function in IT support organizations. Person 1 Person 2 IV. SYMIAN: THE SYMIAN DECISION SUPPORT TOOL Person 3 Person 4 VI. SYMIAN: ARCHITECTURE AND IMPLEMENTATION Figure 4 Shows the main components of the tool: the Configuration Interface (CI), the User Interface (UI), the Configuration Manager (CM), the Parameter Identification Module (PIM), the Simulator Core (SC), the Data Collector (DC), the Trace Analyzer (TA), the Statistics Module (SM), and the Reporting Module (RM). More specific, the Configuration Interface component allows users to configure the IT support organization to simulate. The User Interface component allows users to load simulation parameters from a file, to change current simulation parameters, to save current simulation parameters to file, and to start simulations. The Configuration Manager takes care of the simulator configuration, enforcing the user-specified behaviors, e.g., with regards to verbosity of tracing information, and simulator parameters. The Parameter Identification Module provides statistical inference functions that can determine whether the samples in a given data set are distributed according to a known random variable distribution. The Simulator Core component implements the domain specific model. SC has three sub-components: Incident Generator (IG), Incident Response Coordinator (IRC) and Incident Processor (IP). The Data Collector component collects data from the simulation that can be post-processed to assess the performance of incident management in the modeled organization. Finally, the Statistics Module and the Reporting Module respectively provide basic statistics and reporting functions for the higher layer components. VII. EXPERIMENTAL RESULTS This is an experimental evaluation of the SYMIAN effectiveness in the performance analysis and optimization of a real-life IT support organization. For this experiment, we used data provided to us by the Outsourcing Services Division of HP. The data used for this experiment comes from the subset of the organization serving a single enterprise customer from the financial services industry, whose name will be disguised as BailUsOut. A. ''Model Inference and Validation'' It obtains database logs of incidents for a 6-month period, consisting of data for more than 23,000 incidents. For each incident, the data carried transactional information about the arrival and departure times at each visited support group. Using SYMIAN statistical analysis and inference functions on transactional data, we were able to construct a reasonably accurate model of the BailUsOut ''IT support organization. First, we constructed the escalation matrix and derived the stochastic transition matrix by normalization. Then, we modeled each support group as a G/M/s first-come-first-served (FCFS) queue. In addition, transactional data did not contain information about actual service times, but only on the aggregate waiting plus service times. Thence, for each support group we had to estimate the mean service time parameter. Finally, to model incident inter-arrival times we used a random exponential probability distribution with a rate parameter estimated from transactional data. After compared the outcome of the simulation with the transactional data, it is verified that SYMIAN could reproduce the behavior of the ''BailUsOut ''IT support organization with reasonably good fidelity. Fig. 5 shows the comparison of the (empirical) cumulative distribution functions of total incident service times; Fig. 6 that of the visited support group numbers per incident; and Fig. 7 that of the number of received incidents at each support group for transactional data and simulation outcome. In order to verify the accuracy of the model, we performed astatistical null hypothesis analysis of the results. then analyzed the densities of historical and simulated service time trough kernel density estimation, and verified that the match is very good in the distribution tails (see Fig. 8), although not as good for low service times. This is also confirmed by the mismatch for low service times in Fig. 5. Similar results have been obtained for received incidents and hops. By applying kernel density estimation to analyze sojourn time at the various support groups, we have discovered that support groups in the ''BailUsOut ''IT support organization have different (usually three) service priorities. In fact, the inference of the number of operators in each support group and their allocation to work on incidents of different queues would be very challenging. In addition, the reallocation of workforce at the single support group level would require to consider a much larger number of parameters, therefore significantly complicating the performance tuning task for SYMIAN users. Person 5 B. Evaluation of Configuration Changes In this part, this paper presents an experimental evaluation of the effectiveness of SYMIAN in the performance analysis and improvement of a real IT support organization, which is based on the model construction and validation process. SYMIAN is used to minimize the service disruption time in the BailUsOut organization, with the constrain of preserving the current number of operators. So the objectives of the performance improvement process are MICD(maximization of the mean incidents closed daily) metric, as well as MTTR(minimization of the mean time to resolution) metric. In order to locate the performance bottleneck, a formula was introduced to calculate the bottleneck score for each support group. BSi = ( FIi + FOi ) *RIi * WTi, where FIi and FOi are the fan in and fan out, RIi is the number of received incidents, WTi is the waiting time at support group SGi. A figure was drawn by using the formula, which is shown below: It is obvious that support group SG22 and SG39 are the major performance bottlenecks of the organization. Moreover, the uneven distribution illustrated that optimizing SG22 and SG39 might improve the performance of the BailUsOut IT support organization. In order to improve the organization performance, we used increasing the operator efficiency and emulating an improvement in operator performance. Then we launched a 40-round simulation to assess the impact for each change. The table below shows the comparison of performance metrics measured in the simulations of configuration changes to the BailUsOut IT support organization. The results provided for the MTTR and MICD metrics are mean values and 95% confidence intervals calculated over 40 simulation runs. In this table, the MTTR improvement is 17.18% and the MICD improvement is 2.05%. For the consideration of the effect of support group merging and splitting optimization options, we may merge SG22 and SG50. However, in fact, it might be infeasible in practice, SG50 is a support group shared with other IT support organizations while SG22 is a dedicated support group. Furthermore, SG22 is already with the highest workload. On the other hand, support groups with highest workloads such as SG22 and SG 39 show the best candidates for splitting. So in theory and some simulation data, the optimization option of merging seems to be feasible. However, in practice, a high workload support group cannot be merged with other groups. For the splitting one, there is limited data in this paper which may support this method. While in these performance improvement processes, they all went without business aspect which might bring some impact to them. In conclusion, small changes to the configuration of support groups with a large fan in and fan out can have a relatively large impact on the whole system behavior. It is expected that local optimization might not be a very effective practice in IT support organizations with strongly interconnected support groups and an even distribution of workload. In this case, it is needed to adopt different optimization strategies and consider the relationships between the different support groups. VIII. RELATED WORK For a detailed review of BDIM, in 5, some early works in BDIM include applications to change management, in 67, capacity management and place SLA design, in 8910, network security 11, and network configuration management 12. For the assimilating to approaches to business operation analysis that aim at improving business processes through collection of metrics and making inferences over them, some examples in 13 and simulation methods in 14. Diao ''et al.’s recent studies on the estimation of labor cost and business value of IT services from the analysis of process complexity in 1516. An article 3 has extensively studied the business impact of incident management strategies, using a methodology that moved from the definition of business-level objectives such as those commonly used in balanced scorecards in 17. The analysis of the incident management process and the IT support organization model in this paper are founded in 4. For modeling IT support organizations, in 21 22. WISE 23 proposed what-if analysis, it represents an interesting approach to estimate the outcome of complex network management operations before putting them in practice. Queuing network-based models have been applied in a broad spectrum of research area such as computing, communications, transportation systems, health care, manufacturing systems, and supply chain systems, which can be founded in 24, 25, 26, 27, 28 and 29 respectively. IX. CONCLUSIONS AND FUTURE WORK After all of these works we can see that it is very complex to optimize the performance of large-scale IT support organizations. This paper presented the SYMIAN tool for the performance optimization of incident management in IT support organizations. The application of SYMIAN in real-life IT support organizations demonstrates the tool effectiveness in the performance analysis and improvement process. In the SYMIAN evaluation process, open queuing network models could reproduce the behavior of real-life IT support organizations with a very high degree of accuracy. The results call for further study, which could bring to a deeper understanding of the performance of the incident management function in IT support organizations. Moreover, the SYMIAN decision support tool may be very useful in commercial applications. '''Reference 1: ↑ C.Bartolini, C. Stefanelli, and M. Tortonesi, "SYMIAN: Analysis and Performance Improvement of IT Incident Management Process", IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 7, NO. 3, SEPTEMBER 2010. 1 United Kingdom Office of Government Commerce, “ITIL Service Delivery" and “ITIL Service Support," IT Infrastructure Library version 3, 2007. 2 IT Governance Institute, COBIT 3rd edition, 2000, http://www.isaca.org/COBIT.htm 3 C. Bartolini, M. Sallé, and D. Trastour, “IT service management driven by business objectives—an application to incident management," in Proc. IEEE/IFIP Network Operations and Management Symposium (NOMS 2006), Apr. 2006. 4 G. Barash, C. Bartolini, and L. Wu, “Measuring and improving the performance of an IT support organization in managing service incidents," in Proc. 2nd IEEE Workshop on Business-driven IT Management (BDIM 2007), Munich, Germany, 2007. 5 A. Moura, J. Sauvé, and C. Bartolini, “Research challenges of businessdriven IT management," in Proc. 2nd IEEE/IFIP International Workshop On Business-Driven IT Management (BDIM 2007), Munich, Germany. 6 A. Keller, J. Hellerstein, J. L. Wolf, K. Wu, and V. Krishnan, “The CHAMPS system: change management with planning and scheduling," in Proc. IEEE/IFIP Network Operations and Management Symposium (NOMS 2004), Apr. 2004 7 J. Sauvé, R. Rebouças, A. Moura, C. Bartolini, A. Boulmakoul, and D. Trastour, “Business-driven decision support for change management: planning and scheduling of changes," in Proc. DSOM 2006, Dublin, Ireland 8 S. Aiber, D. Gilat, A. Landau, N. Razinkov, A. Sela, and S. Wasserkrug, “Autonomic self–optimization according to business objectives," in Proc. International Conference on Autonomic Computing, 2004. 9 D. Menascé, V. A. F. Almeida, R. Fonseca, and M. A. Mendes, “Business-oriented resource management policies for e-commerce servers," Performance Evaluation 42. Elsevier Science, 2000, pp. 223- 239. 10 J. Sauvé, F. Marques, A. Moura, M. Sampaio, J. Jornada, and E. Radziuk, “SLA design from a business perspective," in Proc. DSOM 2005. 11 H. Wei, D. Frinke, O. Carter, et al., “Cost–benefit analysis for network intrusion detection systems," in Proc. 28th Annual Computer Security Conference, Oct. 2001. 12 R. Boutaba, J. Xiao, and I. Aib, “CyberPlanner: a comprehensive toolkit for network service providers," in Proc. 11th IEEE/IFIP Network Operation and Management Symposium (NOMS 2008), Salvador de Bahia, Brazil. 13 F. Casati, M. Castellanos, U. Dayal, and M. C. Shan, “A metric definition, computation, and reporting model for business operation analysis," in Proc. Advances in Database Technology - EDBT 2006, 10thInternational Conference on Extending Database Technology, Munich, Germany, Mar. 2006 14 K. Tumay, “Business process simulation," in Proc. Simulation Conference 1995, Winter Volume, Dec 1995 pp, 55–60. 15 Y. Diao, A. Keller, S. Parekh, and V. Marinov, “Predicting labor cost through IT management complexity metrics," in Proc. 10th IEEE/IFIP Symposium on Integrated Management (IM 2007), Munich, Germany. 16 Y. Diao and K. Bhattacharya, “Estimating business value of IT services through process complexity analysis," in Proc. 11th IEEE/IFIP Network Operation and Management Symposium (NOMS 2008), Salvador de Bahia, Brazil. 17 R. Kaplan and D. Norton, “The balanced scorecard: measures that drive performance," Harvard Business Review, vol. 70, no. 1, pp. 71-79, 1992. 18 G. Bolch, S. Greiner, H. de Meer, and K. Trivedi, Queuing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications, 2nd edition. Wiley 2006. 19 U. N. CityplaceBath, An Introduction to Queuing Theory. Birkhäuser, 2008. 20 Y. B. Kim and J. Park, “New approaches for inference of unobservable queues," in Proc. 2008 Winter Simulation Conference. 21 Q. Shao, Y. Chen, S. Tao, X. Yan, and N. Anerousis, “Efficient ticket routing by resolution sequence mining," in Proc. 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), Las Vegas, NV, USA, Aug. 2008. 22 Q. Shao, Y. Chen, S. Tao, X. Yan, and N. Anerousis, “EasyTicket: a ticket routing recommendation engine for enterprise problem resolution," in Proc. 34th International Conference on Very Large Data Bases (VLDB’08), Auckland, New Zealand, Aug. 2008. 23 M. Bin Tariq, A. Zeitoun, V. Valancius, N. Feamster, and M. Ammar, “Answering what-if deployment and configuration questions with WISE." 24 D. Liu, Y.-D. Cao, and C.-Q. Li, “Liana: a decentralized load-dependent scheduler for performance-cost optimization of grid service," J. Supercomputing, vol. 49, no. 1, pp. 127-156, July 2009. 25 N. Bisnik and A. Abouzeid, “Queuing network models for delay analysis of multihop wireless ad hoc networks," Ad Hoc Networks, vol. 7, no. 1, pp. 79-97, Jan. 2009. 26 D. Mitchell and J. MacGregor Smith, “Topological network design of pedestrian networks," Transportation Research Part B: Methodological, vol. 35, no. 2, Feb. 2001. 27 N. Koizumi, E. Kuno, and T. Smith, “Modeling patient flows using a queuing network with blocking," Health Care Management Science, vol. 8, no. 1, Feb. 2005. 28 V. Bhaskar and P. Lallement, “A four-input three-stage queuing network approach to model an industrial system," Applied Mathematical Modelling, vol. 33, no. 8, Aug. 2009. 29 Q. Gong, K. K. Lai, and S. Wang, “Supply chain networks: closed Jackson network models and properties," International Journal of Production Economics, vol. 113, no. 2, June 2008. 30 P. Kvam and B. Vidakovic, Nonparametric Statistics with Applications to Science and Engineering. Wiley, 2007. 31 C. Bartolini, C. Stefanelli, and M. Tortonesi, “Business-impact analysis and simulation of critical incidents in IT service management," in Proc. 11th IFIP/IEEE International Symposium on Integrated Network Management (IM 2009), June 2009, New York.